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Abstract 

The importance of the q-Gaussian family of distributions lies in its power-law nature, and its close 
association with Gaussian, Cauchy and uniform distributions. This class of distributions arises from 
maximization of a generalized information measure. We use the power-law nature of the q-Gaussian 
distribution to improve upon the smoothing properties of Gaussian and Cauchy distributions. Based 
on this, we propose a Smoothed Functional (SF) scheme for gradient estimation using q-Gaussian 
distribution. Our work extends the class of distributions that can be used in SF algorithms by 
including the g-Gaussian distributions, which encompass the above three distributions as special 
cases. Using the derived gradient estimates, we propose two-timescale algorithms for optimization of 
a stochastic objective function with gradient descent method. We prove that the proposed algorithms 
converge to a local optimum. Performance of the algorithms is shown by simulation results on a 
queuing model. 



1 Introduction 



Shannon (1948) provided the concept of entropy as a measure of information, or more precisely, a measure 



of uncertainty given by probability distributions. Renyi ( 1961 ) was the first to introduce the concept of 



generalized measures of information when he proposed the first well-known generalization of Shannon 
entropy, known as a-entropy or Renyi entropy, based on Kolmogorov-Nagumo averages. Several other 



generalizations have also been studied in the literature (Perez 1968 Daroczy 1970), and have been 



extensively used in physics, communication theory and other disciplines. 

One of the most recently studied generalized information measure is the nonextensive entropy, due 
to |Tsallis| ( |T988| >, defined as 

H q (p) = -'£p(xyin q (p(x)), (1) 
where p is the probability mass function of the discrete random variable on the set X , and the q-logarithm 



is defined as ln g (x) 
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q G M, q 7^ 1. It is called nonextensive because of its pseudo-additive 



nature (Tsallis 1988 1. Although this generalization had been introduced earlier in (Havrda and Charvat 



1967), Tsallis provided interpretations in the context of statistical mechanics. Suyari (2004) generalized 



the Shannon-Khinchin axioms to the nonextensive case. 

The most important characteristic of these generalized information measures is that, on maximiza- 
tion, they give rise to power-law distributions, while maximization of Shannon entropy gives rise to 
exponential distributions. The concept of maximization of information measure can be attributed to 



Kullback's minimum discrimination theorem ( Kullback 1959 1 , which establishes important connections 
between statistics and information theory. This theorem shows that exponential distributions can be 
obtained by minimizing Kullback-Leibler divergence under given moment constraints. One can consider 



maximum entropy ( Jaynes 1957) as a special case, where maximization of Shannon entropy under mo- 
ment constraints leads to exponential distributions. For example, given mean and variance of a random 
variable, maximum entropy gives rise to Gaussian distribution. 

While exponential distributions have been extensively studied and used in statistical modeling, the 
power-law behavior has been observed in most of the real- world data, e.g., (Barabasi and Albert 1999 



Pareto 1906). The importance of Tsallis entropy can be attributed to its connections with power- 
law distributions, which has been studied in different contexts like finance, earthquakes and network 



traffic (Abe and Suzuki 2003| [2005| . Compared to the exponential family, the Tsallis distributions, 
i.e., the family of distributions resulting from maximization of Tsallis entropy, have an additional shape 
parameter q, similar to that in ([lj that controls the nature of the power-law tails. One of the most 



studied Tsallis entropy maximizer is the q-Gaussian distribution (Prato and Tsallis 19991, which is a 



power-law generalization of Gaussian distribution. In this paper, we study the q-Gaussian distribution 
in the context of smoothed functional algorithms for stochastic optimization. 

Stochastic techniques play a key role in optimization problems, where the objective function does 
not have an analytic expression. Such problems are often encountered in discrete event systems, which 
are quite common in engineering and financial world. Most often, the data, obtained via statistical 
survey or simulation, contains only noisy estimates of the objective function to be optimized. One of 
the most commonly used solution methodologies involves stochastic approximation algorithms, originally 



due to Robbins and Monro (1951), which is used to find the zeros of a given function. Based on this 
approach, gradient descent algorithms have been developed, in which the parameters controlling the 
system track the zeros of the gradient of the objective. However, these algorithms require an estimate 
of the cost gradient. Kiefer and Wolfowitz (1952) provide such a gradient estimate using several parallel 



simulations of the system. More efficient techniques for gradient estimation, have been developed based 
on the smoothed functional approach (Katkovnik and Kulchitsky 1972 Bhatnagar and Borkar| p003[ ), 
simultaneous perturbation stochastic approximation (Spall 1992) etc. A stochastic variation of Newton- 



based optimization methods, also known as adaptive Newton-based schemes has also been studied in 



literature (Ruppert 1985 Spall 2000 Bhatnagar 2007) 



When the above schemes for gradient estimation are employed in optimization methods involving 
long-run average cost objective, the time complexity of the algorithms increase as the long-run average 
cost needs to be estimated after each parameter update. A more efficient approach is to simultane- 
ously perform the long-run averaging and parameter updates using different step-size schedules. These 



classes of algorithms constitute the multi-timescale stochastic approximation algorithms (Bhatnagar and 



Borkar 1998). Two-timescale optimization algorithms have been developed using simultaneous pertur- 
bations (Bhatnagar et al. 2003) and smoothed functional (Bhatnagar 2007) schemes. The main issue 



with such algorithms is that, although convergence of the algorithm to a local optimum is guaranteed, 
the global optimum is often not achieved in practice. Bhatnagar (2007) proves that the gradient SF 



schemes based on Gaussian perturbations converge to a local minimum, and provides comparison of 
the performance of various multi-timescale algorithms for stochastic optimization on a queuing system. 
The results presented there indicate that the performance of the SF algorithms depends considerably on 
several tuning parameters, such as the variance of the Gaussian distribution, and also the step-sizes. 



Summary of our contributions 

The smoothed functional schemes for simulation based optimization have become popular due to their 
smoothing effects on local fluctuations. We derive for the first time, smoothed functional algorithms 
with power-law (q-Gaussian) perturbations. 



Prior work (Rubinstein 1981 Styblinski and Tang 1990) indicated that the class of distributions 



that can be used for the perturbation random variables in SF algorithms includes Gaussian, Cauchy 
and uniform distributions. Our main contribution is to show that the q-Gaussian family of distributions 
belong to this class, and encompasses the above three distributions as special cases. This allows us to 
work with a larger class of distributions in SF algorithms, in an unified way, where the "shape parameter" 
of the q-Gaussian controls its power-law behavior. This parameter also controls the smoothness of the 
convolution, thereby providing additional tuning. 

We show that the multivariate q-Gaussian distribution satisfies all the conditions for smoothing 
kernels discussed in (Rubinstein 1981). We then present estimators for gradient of a function using 



the q-Gaussian smoothing kernel. We also present multi-timescale algorithms for stochastic optimization 
using q-Gaussian based SF that incorporate gradient based search procedures, and prove the convergence 
of the proposed algorithms to the neighborhood of a local optimum. The convergence analysis presented 
in this paper differs from the approaches that have been studied earlier (Bhatnagar 20071. Here 



we 



provide a more straightforward technique using standard results from (|Borkar| |2008| |Kushner and Clark 



Further, we perform simulations on a queuing network to illustrate the benefits of the q-Gaussian 
based SF algorithms compared to their Gaussian counterparts. A shorter version of this paper containing 
only the one-simulation q-Gaussian SF algorithm, and without the convergence proof, has been presented 
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in IEEE 2012 International Symposium on Information Theory ( |Ghoshdastidar et al. 2012). 



The rest of the paper is organized as follows. The framework for the optimization problem and some 
preliminaries on SF and g-Gaussians are presented in Section|2] Section [3] validates the use of g-Gaussian 
as smoothing kernel, and presents gradient descent algorithms using q-Gaussian SF. The convergence 
analysis of the proposed algorithms is discussed in Section |4| Section [5] presents simulations based on 
a numerical setting. Finally, Section [6] provides the concluding remarks. In the appendix, we discuss a 
sampling technique for multivariate q-Gaussians that is used in the proposed algorithms. 

2 Background and Preliminaries 
2.1 Problem Framework 

Let {Y n : n 6 N} C M. d be a parameterized Markov process, depending on a tunable parameter 9 € C, 
where C is a compact and convex subset of R*. Let Pe(x, dy) denote the transition kernel of {Y n } 
when the operative parameter is 9 E C . Let h : M. d i— ► K+ 1J{0} be a Lipschitz continuous cost function 
associated with the process. 

Assumption I. The process {Y n } is ergodic for any given 9 as the operative parameter, i.e., as L — > oo, 

L-l 

^ £ /i(y ro ) E v ,[fc(y)], 

where v§ is the stationary distribution of {Y n }. 

Our objective is to minimize the long-run average cost 

L-l 

J{9) = lim - V h(Y m ) = / h(x)v e {dx), (2) 

L— >oo L J 

by choosing an appropriate 9 G C . The existence of the above limit is assured by Assumption [I] and the 
fact that h is continuous, hence measurable. In addition, we assume that the average cost J{9) satisfies 
the following requirement. 

Assumption II. The function J(.) is continuously differentiable for all 9 £ C . 

Definition 2.1 (Non-anticipative sequence). A random sequence of parameter vectors, (9(n)) n ^o C C, 
controlling a process {Y n } C K d , is said to be non-anticipative if the conditional probability P(Y n+ i G 
dy\ F n ) = Pg(Y n , dy) almost surely for n ^ and all Borel sets dy C M. d , where T n = a(9(m), Y m , m ^ n), 
ft ^ are associated a-fields. 

One can verify that under a non-anticipative parameter sequence (9(n)), the sequence (Y n , 9(n)) n ^o 
is Markov. We assume the existence of a stochastic Lyapunov function. 

Assumption III. Let (9(n)) be a non-anticipative sequence of random parameters controlling the process 
{Yn}, and T n — a(9{m), Y m ,m ^ n), n ^ be a sequence of associated a-fields. There exists e > 0, a 
compact set K. C M. d , and a continuous function V : R d i— » R + 1J{0}> with lim V{x) — oo, such that 

INI-i-oo 

(i) sup E[F(F„) 2 ] < oo, and 

n 

(ii) E[V(Y n+1 )\T n ] ^ V(Y n ) - e , whenever Y n (£JC,n> 0. 



While Assumption nJ\ is a technical requirement, Assumption III ensures that the process under a 



tunable parameter remains stable. Assumption III will not be required, for instance, if, in addition, the 



single-stage cost function h is bounded. It can be seen that the sequence of parameters obtained using 
any of our algorithms below form a non-anticipative sequence. 
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2.2 Smoothed Functionals 



Here, we present the idea behind the smoothed functional approach proposed by K atkovnik and Kulchit-| 
sky (1972). We consider a real-valued function / : C 4 1, defined over a compact set C. Its smoothed 



functional is defined as 



s P [f{e)\ 



G (r l )f(6-r,)dr ] = / Gp[9 - rj)f[rj) d.77, 



(3) 



where Gp : R N H> K is a kernel function, with a parameter /3 taking values from R. The idea behind 
using smoothed functionals is that if f(9) is not well-behaved, i.e., it has a fluctuating character, then 
Sp[f(9)] is "bcttcr-bchavcd" . This can ensure that any optimization algorithm with objective function 
f(9) does not get stuck at a local minimum, but converges to a global minimum. The parameter /3 
controls the degree of smoothness. 



Rubinstein ( 1981 ) established that the SF algorithm achieves these 



properties if the kernel function satisfies the following sufficient conditions: 
(PI) Gp(rj) = JnG where G(x) corresponds to Gp(x) with /? = 1, i.e 



G 



= G X 



77(2) 



(P2) Gp(ri) is piecewise differentiable in 77, 

(P3) Gp(vi) is a probability distribution function, i.e., Sp[f{9)] — E Gj3 (^)[/(6' — 77)], 

(P4) lim^o Gfj{r]) = S(ri), where d(r]) is the Dirac delta function, and 

(P5) lun^ Sp[f(9)]=f(9). 

A two-sided form of SF is defined as 



1 



S' [fm = o / Gp(v)(f(0-v)+f(0 + ri))d V 



Gp(0-v)f(v) dy 



G (v-9)f(r ] ) dr, 



(4) 



The Gaussian distribution satisfies the above conditions, and has been used as a smoothing kernel 



(Katkovnik and Kulchitsky 1972 Styblinski and Tang 1990). The SF approach provides a method for 
estimating the gradient of any function, which satisfies Assumptions [I III (Bhatnagar and Borkar 2003 ). 
Bhatnagar ( 2007 ) uses the Gaussian smoothing, and derives a gradient estimator from ([3j) as 



M-l L-l 



VeJ{9) 



f3ML 



J2 E v(n)h(Y m ) 



(5) 



71—O m— 



for large M, L and small /3. The stochastic process {Y m } is governed by parameter (9(n) + /3rj(n)), 

where 9(n) € C C is obtained through an iterative scheme, and 77(71) = (^(^(n), . . . ,77(^(71)) is a 
TV-dimensional vector of i.i.d. standard Gaussian random variable. Similarly, a two-simulation gradient 
estimator has been suggested using Q , which is of the following form 



V e J(9) 



1 

2PML 



M-l L-l 



E ^2v(n)(h(Y m )-h(Yi)) 



(G) 



n— m— 



for large M, L and small /3, where {Y m } and {Y^} are two processes governed by parameters (9(n) 
(3rj(n)) and (9(n) — (3r](n)), respectively, 9{n) and 77(71) being as before. 
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2.3 Generalized information measure and the g-Gaussian distribution 

A continuous form of the Shannon entropy, also known as differential entropy, has been extensively 
studied in statistical mechanics, probability and statistics. It is defined as 



H(p) = I p(x)h\p{x) dx, 



(7) 



x 



where is a p.d.f. defined on the sample space X . Following the lines of discrete form generalization 



given in (Tsallis 1988), Dukkipati et al. (2007) provides a measure theoretic formulation of continuous 



form Tsallis entropy functional, defined as 



H q ( P ) 



x 



(j>(x)) q dx 



?6l,^l. 



(8) 



This function results when the natural logarithm in Q is replaced by the q-logarithm defined earlier. 
The differential Shannon entropy can be retrieved from pi) as q 1. 



The g-Gaussian distribution was developed to describe the process of Levy super-diffusion (Prato and 



Tsallis 1999 ) , but has been later studied in other fields, such as finance ( Sato 2010 ) and statistics ( Suyari 



2005). Its importance lies in its power-law nature, due to which the tails of the g-Gaussian decay at a 



slower rate than the Gaussian distribution, depending on q. It results from maximizing Tsallis entropy 
under certain 'deformed' moment constraints, known as normalized ^-expectation defined by 

/ f(x)p(x)i dx 

{f)q = fp(x)idx ■ (9) 



This form of an expectation considers an escort distribution p q (x) 



P{x) q 
j s p(x)i dx 



, and has been shown to 



be com patible with the foundations of nonextensive statistics ( Tsallis et al.|[l998[) . Prato and Tsallis 



(1999) maximized Tsallis entropy under the constraints, (x) q = [i q and ((x — fj,) 2 } q = fi q , which are 



known as g-mean and g-variance, respectively. These are generalizations of standard first and second 



moments, and tend to the usual mean and variance, respectively, as q 
distribution^ that has the form 



1. This results in the q-Gaussian 



/3 q K q 



(1-9) 



(3 



for all x € 



(10) 



where, y + = max{y, 0) is called the Tsallis cut-off condition ( Tsallis 1995 ), which ensures that the above 
expression is defined, and K q is the normalizing constant, which is given by 

\l-q 



K q = 



^/7Ty/3 — q 

sfifVS^q r ( 2( g -\) ) 
^ r (^) 



for 



-co < q < 1, 



for 1 < q < 3, 



with r being the Gamma function, which exists over the specified intervals. The function defined in ( 10 ) 
is not integrable for q 3, and hence, q-Gaussian is a probability density function only when q < 3. 



A multivariate form of the g-Gaussian distribution has been discussed in (Umarov and Tsallis 2007 



Vignat and Plastino 2007 1 . Considering the g-mean and q-covariance matrix to be [i q and S 9 respectively, 
the iV-variate g-Gaussian distribution can be expressed as 



K, 



q,N 



1 

^2 



1 - 



(1- 

(A^ + 2 



q) 



Nq) 



for all X e 



where the normalizing constant 

w/ 2 r(^f) 



K, 



q,N 



N+2-Nq 
1-9 



( N+2-Nq 

V i- 1 



for q < 1, 



(11) 



(12) 



^ w/2 r(^i-f) 
r(^r) 



for 1 < q < (l + 



1 The notation G q ,p q is used to maintain consistency with smoothed functional notations, which is used in Section |3j 
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As in the one-dimensional case, the distribution is only defined for q < 1 + can be mapped to the 
multivariate Students'-i distribution. The support set of the above distribution is given by 



n q = 



{xGR^fl-^fl-ii^^} for q<l, 



(13) 



for 1< q < (1 + f ) 



The multivariate normal distribution can be obtained as a special case when q 1, while for q — > — oo, 
we obtain the uniform distribution with an infinitesimally small support around its mean. It is interesting 
to note that for q > 1, these distributions have a one-to-one correspondence with Student's-i distribution, 
and in particular, for q = (1+ jAj), we retrieve the Cauchy distribution. A similar distribution can also 
be obtained by maximizing Rcnyi entropy (Costa et al. 2003). In this paper, we study the multivariate 



q-Gaussian distribution as a smoothing kernel, and develop smoothed functional algorithms based on it. 



3 g-Gaussian based Smoothed Functional Algorithms 

3.1 g-Gaussian as a Smoothing Kernel 

The first step in applying g-Gaussians for SF algorithms is to ensure that the distribution satisfies the 
Rubinstein conditions (properties (P1)-(P5) in Section 2.2 1. The rest of the paper uses the multivariate 
form of g-Gaussian (11), with the q-mean /i q = 0, and q-covariance matrix E 9 = P 2 Iny.n, i.e., the 
components are uncorrelated. 



Proposition 3.1. N -dimensional q- Gaussian distribution (| 1 1 p , with q-covariance (3 InxN satisfies the 

N J 



kernel properties (P1)-(P5) for all q < (l + j^J, q ^ 1. 



Proof. (PI) From (11), it is evident that G q .p(x) = -^G, 



—G - 

(3 N q \p 

(P2) For 1< q < (l + £), G q>p {x) > for all x € R N . Thus, 



11 (N+2-Nq)P 2 W X \ 



For q < 1, (14) holds when x € Q q . On the other hand, when x ^ fl q , we have G q ^{x) = and 
hence, V x G qt p(x) = 0. Thus, G q> p(x) is differentiable for q > 1, and piecewise differentiable for 
g<l. ' 

(P3) G q< p(x) is a distribution for q < (l + and hence, the corresponding SF S q .p(.), parameterized 
by both q and /3, can be written as S q ^[f(9)\ = E G<j l3 ( x )[f{9 — x)]. 

(P4) G q .fi is a distribution satisfying lim G qi p(fS) — oo. So, lim G q .p(x) = S(x). 
(P5) This property trivially holds due to convergence in mean as 



nmS g Af(0)] = J \imG q ,p{x)f{6 - x)dx = \ 5{x)f{6 - x) dx = f{9). 

— oo 

Hence the claim. □ 
From the above result, it follows that q-Gaussian can be used as a kernel function, and hence, given 

a particular value q e ( — oo, l)(J(l,l+^j and some (3 > 0, the one-sided and two-sided SFs of any 

function / : R N n- R are respectively given by 



S q Afm = J G qt p(9-x)f(x)dx, (15) 
n q 



where the nature of the SFs are controlled by both q and f3. 
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3.2 One-simulation g-Gaussian SF Gradient Estimate 

The objective is to estimate the gradient of the average cost WgJ(9) using the SF approach, where 
existence of Vg J (9) follows from AssumptionjllJ The gradient of smoothed functional (smoothed gradient) 
is defined as (Rubinstein 1981) 



where Q q is the support set defined as in (13). As there is no functional relationship between 9 and rj 



dr?W 



over Q q , i.e., ^ ( . 



VfG q , p {B-ri) 



for all 



2(^-0(0) / (i- g )ELi(g (fc) -^ (fc) ) 



where 



p(v) = (i 



(N+2-Nq) IMP 



P N K q , N ^(N + 2-Nq) 

2 (77(0 - 6»(0) 

P*(N + 2-Nq) p («=2) 

Hence, substituting rf — a 



(iV + 2 — 7Vg)/3 2 
Gq,p{9 - v) ! (17) 

o , and using the symmetry of G q ,p(.) 



and /?(.), we can write 



/3(N + 2-Nq) 



G q (v') 



(18) 



In the sequel (Proposition 4.8), wc show that \\V gS q ^[J(9)] — VgJ{8)\\ — > as j3 ~ > 0. Hence, for 



large M and small /3, the form of gradient estimate suggested by (18) is 

M-i 



VgJ(8) 



0(N + 2 - Nq)M 



E 



y{n)J(9 + (3rj(n)) 
p{rj(n)) 



(19) 



where 77(1) , r/(2) , . . . ,77(71) are uncorrelated identically distributed standard g-Gaussian distributed ran- 
dom vectors. Considering that in two-timescale algorithms (discussed later), the value of 9 is updated 
concurrently with the gradient estimation procedure, we estimate Vg J(6(n)) at each stage. By ergodicity 
assumption (Assumption |TJ), we can write (19) as 



V fl J(0(n)) 



pML(N + 2-Nq) 



Af-l L-l 

EE 



T)(n)h(Y w 



m=0 ^1 



(i-g) 

(JV+2-JVg) 



h(n)|| ; 



(20) 



for large i, where the process {i^m} has the same transition kernel as defined in Assumption |TJ except 
that it is governed by parameter (9(n) + /3rj(n)). 

3.3 Two-simulation g-Gaussian SF Gradient Estimate 

In a similar manner, based on Q, the gradient of the two-sided SF can be written as 



V s S' q ^[J{9)} 



V e G p {9 - 77) J{ V ) dT? + - / V e G /3 (7 / - 0)J{rj) dr,. 



(21) 



The first integral can be obtained as in (18). The second integral is evaluated as 



VeG ( v - 0)^) d, = p{N + 2 _ Nq) J ^y<W) J(« - M) *t , 
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where rf = ^—ft-. Thus, we obtain the gradient as a conditional expectation 



V e S' g> p[J(e)] = 
In sequel (Proposition 



1 



/3(N + 2-Nq) 



~-G q (v) 



J{9 + 0ti) - J(6 - fa) 



(22) 



4.10) we show that 



VeS' qi0 [J(d)}-V e J(9) 



to approximate (22 1, for large M, L and small (3, as 
V fl J(0(n)) 1 



as j3 — > 0, which can be used 



w 1 L ' r,(n)(h(Y m )-h(Y^) 



/3ML(N - 



Nq) 



EE 

n — rn—0 



1 - 



(i-g) 



\v(n)\\ 



(23) 



where {Y m } and are governed by (0(n) + /3r)(n)) and (9(n) — f3r/(n)) respectively. 

3.4 Proposed Gradient Descent Algorithms 

We propose two-timescale algorithms based on the estimates obtained in (20) and (23). Let (a(n)) n j>o 
and (b(n)) n ^ be two step-size sequences satisfying the following. 

00 

Assumption IV. (a(n))„^o an d (b(n)) n ^o are two positive step-size sequences satisfying a(n) 2 < 00, 



71=0 



b(n) 2 < 00, a(n) = &(n) = 00 and a(n) — o(b(n)), 



i.e. 



a(n) 



as n — > 00. 



n=0 



n=0 



n=0 



It must be noted that in the algorithms, although M is chosen to be a large quantity (to ensure 
convergence), the quantity L is arbitrarily picked and can be any finite positive number. The averaging 
of the inner summation in (20) and (23) is obtained in our algorithms using two-timescale stochastic 
approximation. In principle, one may select L = 1. However, it is generally observed that a value of L 
typically between 5 and 500 results in better performance (Bhatnagar 20071. Further, the algorithms 
require generation of ./V-dimensional random vectors, consisting of uncorrelated g-Gaussian distributed 
random variates. This method is described in Appendix. 

For 9 = . . . ,0W) T G R N , let V c (9) = (V c (9 {1) ), V C {9 {N) )) T represent the projection of 9 
onto the set C. For simulation, we need to project the perturbed random vectors (9(n) + j3rj(n)) onto 
C using the above projection. The quantities (Z^'(n), i — 1, . . . , N)„^>q are used to estimate VgJ(9) in 
the recursions. 



The Gq-SFl Algorithm 



1 Fix M, L, q and /3; 

2 Set Z (l) (0) =0,i = l,...,N; 

3 Fix the parameter vector 0(0) = (0 (1) (0), 6» (2) (0), . . . , 6 (N) (0)) T ; 

4 for n = to M - 1 do 

5 Generate a random vector n(n) = (rj^(n), n^ 2 \n), . . . , n^ N \n)) T from a standard iV-dimensional 
g-Gaussian distribution; 

6 for m = to L — 1 do 

7 Generate the simulation Y n L+ m governed with parameter Vc(9(n) + /3n(n)); 

8 for i = 1 to N do 

2r,(')(n)fa(y„ Ij + m ) 



Z {i) {nL + m + 1) = (1 - 6(7i))Z (i) (7iL + m) + &(n) 



g(N+2-N q )(i- {N ^-S q) IWWII 2 ) 



10 end 

11 end 

12 for i = 1 to iV do 

13 6 {l) (n + 1) = P c (# W (n) - a(n)Z {l) {nL)\ ; 

14 end 

is Set 6»(n + 1) = (6» (1) (n + 1), 6» (2) (n + 1), . . . , (JV) (n + 1)) T ; 

16 end 

17 Output 6{M) = (6> (1) (M), . . . , 6 {N) (M)) T as the final parameter vector; 
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The Gg-SF2 algorithm is similar to the Gg-SFl algorithm, except that we use two parallel simulations 
YnL+m and Y^ L+m governed with parameters (9(n) + f3r/(n)) and (9(n) — j3r)(n)) respectively, and update 



the gradient estimate, in Step 9, using the single-stage cost function of both simulations as in (23) 



The Gq-SF2 Algorithm 



1 Fix M, L, q and /3; 

2 Set Z«(0) =0,i = l,...,N; 

3 Fix the parameter vector 0(0) = (0 (1) (0), (2) (0), . . . , 6> (JV) (0)) T ; 

4 for n = to M - 1 do 

5 Generate a random vector 77(11) = (n (1 \n), r/ 2 ^(n), . . . , rj'^ (ti)) t from a standard iV-dimensional 
g-Gaussian distribution; 

6 for m = to L — 1 do 

7 Generate two simulations Y n L+ m and F,!_L+ m governed with control parameters Vc(9(n) + /3rj(n)) and 
Vc(0(n) — f3r)(n)) respectively; 

8 for i = l to N do 

>/■> (n)(fe(y„ L+m )-h(y^ L+m )) 



9 Z w (nL + m + 1) = (1 - b(n))Z (i) {nL + m) + b(n) 



10 end 

11 end 

12 for i = 1 to N do 

13 W (n + 1) = (0 {t) (n) - a(n)Z w (nL)) ; 

14 end 

15 Set 0(n + l) = (0 (1) (n + l),e (2) (n + l),...,e (JV) (n + l)) T ; 

16 end 

17 Output 0(M) = (0 (1) (Af), . . . , (JV) (M)) T as the final parameter vector; 



4 Convergence of the proposed Algorithms 



We provide here a more straightforward technique to prove that the algorithms converge to a local 



optimum as compared to (Bhatnagar 2007). Before presenting the details of convergence analysis, we 
present the following result on g-Gaussians. It provides an expression for the moments of iV-variate 



g-Gaussian distributed random vector. This is a consequence of the results presented in (Gradshteyn 



and Ryzhik, 1994). This result plays a key role in the proofs discussed below. 

Proposition 4.1. Suppose X = (X& , X {2 \ . . . , jW) e R N is a random vector, where the components 
are uncorrelated and identically distributed, each being distributed according to a q- Gaussian distribution 
with zero q-mean and unit q-variance, with parameter q 6 ( — 00, l) (J (l, 1 + j|). Also, let p(X) = 



(i-g) 

(N+2-Nq) 



||X|| 2 ) . Then, for any b, b±, &2> • • • } &JV G ^ + U{0}> we have 



(x^) bl (x^) b2 



(p(x)T 



K 







N ■ 



Nq\R 



' N 

n 



bA 



where 



K 



r (T^- b+1 ) r (T^ + 1+ f ) 

» / N 



\l-q\ J Vf=i 2bl (l)7 
if bi is even for all i = 1, 2, . . . , N, 

otherwise, 



if qe (-00, 1), 



if Q e (1, 1 + 



N ) ' 



r(^r+f)r(^r-f) 

exists only if the arguments in the above Gamma functions are positive, which holds for b < ^1 
ifq < I, and - f ) > (£f =1 | - b) if 1< q < (l + |) . 



(24) 



(25) 
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Proof. Since E g = I 



NxN 



and p(X) is non-negative over Q q , we have 



-G q (X) 



K, 



1 - 



(1-9) 



(JV + 2 - iVg) 



da;. 



The second equality in (24) can be easily proved. If for some i = 1, . . . , N, bi is odd, then the above 
function is odd, and its integration is zero over Q q , which is symmetric with respect to any axis by 
definition. For the other cases, since the function is even, the integral is same over every orthant. Hence, 
we may consider the integration over the first orthant, i.e., where each component is positive. For q < 1, 



we can reduce the above integral, using ( |Gradshteyn and Ryzhik 1994 equation (4.635)), to obtain 



-G q {X) 



(p(x)) b 



K q , N T{b) 



N 



Nq 



(26) 



One can observe that the integral in ( 26 ) is in the form of a Beta 



where we set b = fy + J2iLi ^ 
function. Since frj's are even, we can expand T ( o" 1 ) using the expansion of Gamma function of half- 
integers to get r (^y^) = - b /l-\ ~^ 7r ' ^ e c l a i m can b e obtained by substituting -Kg, at from (12 1 and 



using the relation B(m,n) = I pj~q^y- It is easy to verify that all the Gamma functions in the equality 
are positive provided b < (l + jz^j ■ The result for the interval 1 < q < (l + j|) can be proved in a 
similar way (see equations (4.635) and (4.636) of ( Gradshteyn and Ryzhik||1994 |). However, in this case 
the Gamma functions are positive if b, b% , &2j • • • > G ^ + U{0} satisfy the mentioned condition. It may 
be noted here that this is always true for any dimension if b > X) s :=i 2 since g-Gaussians are defined 
only when -Lj- > y. □ 
It is easy to verify the following result in the limiting case of q — > 1. This would help to ensure that the 
convergence analysis done in this paper also holds in the case of Gaussian SF. 

Corollary 4.2. In the limiting case, as q — > 1, 



lim E 



G q (X) 



N 

n e g( x ) 



4.1 Convergence of Gg-SFl Algorithm 

First, let us consider the update along the faster timescale, i.e., Step 9 of the Gg-SFl algorithm. We define 



6{p) = 9 in), fj(p) = rj(n) and b(p) = b(n) for nL < p < (n + 1)L, n ^ 0. It follows from Assumption IV 
that a(p) — o[b(p)), J2 p Kp) — 00 an d J2 p Kp) 2 < 00 • We can rewrite Step 9 of Algorithm 1 as the 
following iteration for all p ^ 



Z{p + 1) = Z(p) + b(p) [g(Y p ) - Zip)] , 



(27) 



where 



g{Y P ) 



2f}(p)HYp) 



Nq)p{v(p)) 



Here, p(.) is defined as in (17l, and {Y p : p G N} is a Markov process parameterized by Vc(9(p) + Pfjip))- 
Let G P = cr(^(fc), fj{k), Yfc, k ^ p) denote the cr-field generated by the mentioned quantities. We can 
observe that (G P ) P ^o is a filtration, where gj Y p ) is ^ -measurable for each p 0. 

We summarize the results presented in (Borkar 2008 Chapter 6, Lemma 3 - Theorem 9) in the 



following theorem. This result leads to the stability and convergence of iteration (27), which runs on the 
faster timescale. 



Theorem 4.3. 

hold: 



Consider the iteration, x p+ i — x p + r y(p)[fix p ,Y p ) + M p ] . Let the following conditions 



10 



1. {Y p : p G N} is a Markov process satisfying Assumptions^ and III, 

2. for each x G R and x p = x for all p £ N, Y p has a unique invariant probability measure V x> 

oo oo 

3. (j{p)) p ^Q are step-sizes satisfying Y 7(p) — 00 an d Yl 1 2 {p) < 00 ; 

4- /(•)•) *s Lipschitz continuous in its first argument uniformly w.r.t the second, 

5. M p is a martingale difference noise term with bounded variance, 

6. if f(x,Ux) — E„ x [/(#, y)l , i/ien i/ie limit f{x{t)) = lira — — - — exists uniformly on com- 

a'foo 

pacts, and 

7. the ODE x(t) = f(x(i)) is well-posed and has the origin as the unique globally asymptotically stable 
equilibrium. 

Then the update x p satisfies sup p ||x p || < oo, almost surely, and converges to the stable fixed points of the 
ordinary differential equation ( ODE) 

= S{x{t),v x ( t ))- 



Rewriting the update (27) as 



z( P + 1) = z{p) + b( P ) [s[ 5 (y p )|e p _i] - z{ P ) + a p ] 



(28) 



where A p — g(Y p ) — E[g(Y p )\G p -i] is ^-measurable. The following result shows that (A p , Q p ) p ^ satisfies 
Condition 5 in Theorem 14.31 



Lemma 4.4. For all values of q G f — oo, l) |J (l, 1 + -^), [A p ,Q p ) p ^ is a martingale difference sequence 
with a bounded variance. 

Proof. It is easy to see that for all p ^ 0, E[A p |(J p _i] = 0. So (A p ,G p ) pe fi is a martingale difference 
sequence. Expanding the terms, we have 

e[|MM| 2 |s p _i] 



(\\fj(p)\\h(Y p )\ 2 




\\\v(p)\\h(Y p ) 




r 




\ p(v(p)) ) 




. p(v(p)) 


Qp-i 




Gp-i 



j5 2 {N + 2- Nq) 2 

Applying conditional Jensen's inequality on the second term, we obtain 



E[\\A p \\ 2 \g p ^} < 



16 



P 2 {N + 2- Nq) 2 



p(v(p)) 



2 h 2 (Y P : 



Qp- 



(29) 



For q G (— oo, 1), we use Holder's inequality to write (29) as 

16 



E^ll 2 1 £?„_!] < 



f3 2 (N + 2-Nq) 2 ™ P y p{ fi {p )) 
16 



(MM 



E[h 2 (Y p )\g p ^] 



l3 2 {l-q)(N + 2- Nq) 



E[h 2 (Y p )\G p ^] 



since, ||?7|| < 



2 , N+2-Nq 



1-9 



and p(rj) ^ 1 for all r\ G f2 9 . By Lipschitz continuity of /i, there exists a\ > 



such that |h(l^)| ^ ai(l + ll^pll) for all p, and hence, by Assumption III we can claim 

E [h(Y p ) 2 \g p ^] < 2a 2 (1 + E [||r p || 2 |£ P -i]) < oo a.s. (30) 

On the other hand, for q G (l, 1 + -^), we apply Cauchy-Schwartz inequality for each of the components 
in ( 29 ) to obtain 

E[|L4 p || 2 |<? P -i] < 



16 



JV 



P 2 (N + 2- Nq)' 



16 



(3 2 {N + 2- Nq) 2 ^ 



N 

z2 E 



p(v{p)) 2 



Qp- 



{v u Hp)Y 
p(v(p)) 4 



4-, 1/2 



E[/i 4 (r p )|g ; 



■p-ij 



I 1/2 
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The second expectation can be shown to be finite as in (301, while we apply Proposition 4.1 to study 
— — — We can observe that in this case, b = 4 and bi = 4 if i = j, otherwise bj = 0. 

□ 



4.1 



ensures that the term is finite, and hence, the claim. 



the existence of E 
and so b > YliLi T - Proposition 

We can write the parameter update (Step 13 of Gg-SFl) as 

00 (n + 1)=V C (fl w (») - 6(n)C(n)) , 

where ((n) = ^0^Z^\nL) = o(l) since a(n) — o(b(n)). Thus, the parameter update recursion can be 
seen to track the ODE 



0(f) = 0. 



(31) 



Hence, the recursion 0(n), n appears quasi-static when viewed from the timescale of (b(n)), and 
hence, in the update |28|), one may let 0{p) = 9 and rj(p) = rj for all peN. Consider the following ODE 



Z(t) 



2 V j(6 + fa) 



Z(t). 



0(N + 2- Nq)p{n) 

Lemma 4.5. The sequence (Z(p)) is uniformly bounded with probability 1. Further, 



(32) 



{P) \P(N + 2-Nq)p(fj( P ))) 







almost surely as p — > oo . 



Proof. It can be easily verified that iteration |28|) satisfies all the conditions of Theorem 4.3 Thus, by 

2n J{6 + (3rj) 



Theorem |4~3| (Z(p)) converges to ODE ([32} as 



2 V h(Y p ) 



We can also see that 



lim 



(3(N + 2-Nq)p(ri) 
2nJ(d + (Sri) 



0(N + 2- Nq)p{r 1 ) 
- aZ(t)) = -Z(t). 



afoo a \0{N + 2 - Nq)p(r]) 

All the conditions in Theorem 14.31 are seen to be verified and the claim follows. 
From Lemma [4. 5| Steps 13 and 15 of Gq-SFl can be written as 



□ 



(n + l)=V c [ 0(n) - o(n) 



2rj{n)j{9{n) + 0rj{n)) 



(3(N + 2-Nq)p(r)(n)) 
V c ( 0(n) + a(n) [-V e(n) J(0(n)) + A(0(n)) + e„] 



where the error in the gradient estimate is given by 

A(0(n)) = V e(n) j(0(n)) - V e{n) S q .p[j(e(n))} 

and the noise term is 

2r](n)j(d{n) + 0n(n)) 



(33) 



(34) 



£n = V e(n) S gi/3 [J(0(n))] 
2 



/3(JV + 2 - TVg) 



: G,(r)) 



0(N + 2-Nq)p{r){n)) 



p(ri(ri)) 



J(0(n) + 0n(n)) 



8(n) 



(35) 



which is a martingale difference term. Let T n — cr(0(O), . . . , 0(n), 77(0), . . . , ?y(n — 1)) denote the cr-field 
generated by the mentioned quantities. We can observe that (^>i)n^o is a filtration, where £o> • ■ ■ j£n-i 
are J-"„-measurable for each n ^ 0. 

We state the following result due Kushner and Clark (1978 Theorem 5.3.1, pp 189-196), adapted to 
our scenario, which leads to the convergence of the updates in (|33|). 



Lemma 4.6. Given the iteration, x n+ i = Vc(x n + l( n ){f( x n) + £,n)), where 



12 



1. Vc represents a projection operator onto a closed and bounded constraint set C , 

2. /(.) is a continuous function, 

3. ("f{n)) n ^o is a positive sequence satisfying j(n) J, 0, X^^Lo l( n ) ~ °°> an( ^ 

4- Xr=o7( n )£n converges a.s. 

Under the above conditions, the update (x n ) converges to the asymptotically stable fixed points of the 
ODE 

x(t)=Vc{f(x(t))), (36) 

where Vc(f(x)) — hm ( - - + ^ — — 



The next result shows that the noise term satisfies the last condition in Lemma 4.6 while the subse- 
quent result proves the error term A(0(n)J is considerably small. 

Lemma 4.7. Let M n = Y^k=o a (k)£,k- Then, for all values of q e ( — oo, l) (J (l, 1 + (M„,7 7 n ) ne N 
is an almost surely convergent martingale sequence. 



Proof. We can easily observe that for all k ^ 0, E | J-jfe] 



P(N + 2 - Nq) 



n(k)j(9(k)+f3 V (k)) 



9(k) 



r)(k)J(9(k) + P7](k)) 



So E^lJ 7 ^] = 0, since 9{k) is ^-measurable, whereas rj(k) is independent of Fk- It follows that 
(Cn,^ 7 n)neN is a martingale difference sequence, and hence (M„, J r „) n6 N is a martingale sequence. Now, 
use of conditional Jensen's inequality leads to 



N 



E[iiaf|^] =E E 



16 



-Fa. 



/3 2 (iV + 2-iVq) 2 ^ 



(^(fc) (j) )' 



j(e(k) + f3 V (k)y 



6{k) 



For any rj € K.^, by definition J(6(k) + prn = E[h(Y p )], where the expectation is with respect to the 
stationary measure. By Jensen's inequality, we can claim J(9{k)+f3n) ^ E [h(Y p ) 2 ] and J(8(k)+/3rj) sC 
E [/i(Yp) 4 ] for all rj £ R N . Using these facts along with arguments similar to Lemma 4.4 it can be seen 
that sup fc E [||£fc|| 2 | J - *] < oo for all k, and hence, if J2 n a ( n ) 2 < °°; 

oo oo oo 

£ E [||M n+1 - M„|| 2 ] =5>(n) 2 E[||6J 2 ] ^^a(n) 2 supE[||e„|| 2 ] < oo a.s. 



n=Q 



n=0 



The claim follows from martingale convergence theorem (Williams 1991 page 111). 



□ 



Proposition 4.8. For a given q < (l + j^) , q ^ 1, and for all 9 £ C , the error term 

V e S q ,p[J(e)}-V e J(9)\\ =o{p). 
Proof. For small j3 > 0, using Taylor series expansion of J(9 + fir)) around 9 € C, 

J{9 + Prj) = J{9) + prfVo J (9) + y^V 2 . + o(/3 2 ). 



So we can write (181 as 



v 9 s q , p [j(e)] = 



(N + 2-Nq)\ p G " in) 



pin) 



~-G q (ri) 



T 1 



p(v) 



ft p 



W V 2 J(% 



p(v) 



VeJ{9) 



(37) 
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We consider each term in (37). The i th component in the first term is Eg 

, N. Similarly, the i th component in the third term can be written as 



= by Proposition 



4.1 



for alH = 1, 



~-G q (v) 



(i) 



N N 



7 j(*) 7 jO') ? j( fc ) 



j=l k=l 

It can be observed that in all cases, each term in the summation is an odd function, and so from 



Proposition 4.1 we can show that the third term in (37 1 is zero. Using a similar argument, we claim that 



the off-diagonal terms in Eq^^ 



which exists for all q G (— oo, 1) (J(l, 1 + N 
this interval. Further, 



are zero, while the diagonal terms are of the form Eq ^ 
1 N as the conditions in Proposition 



4.1 



pxn) 

are always satisfied on 



~-G q (v) 



{N + 2- Nq) 



The claim follows by substituting the above expression in ( 37 1 



Now, we consider the following ODE for the slowest timescale recursion 

' Vc(x+ef(x))-x 



where f> c (f(x)) = lim eJ . ( 
stable points of ( 39 ) lie in the set 



In accordance with Lemma 



4.6 



k = {9ec\r c (-y e J(o)) =o} 



(38) 

□ 

(39) 

it can be observed that the 
(40) 



We have the following key result which shows that iteration (33 1 tracks ODE (391, and hence, the 
convergence of our algorithm is proved. 



Under Assumptions \A 

fio > such that for all (3 G (0,/3q], the sequence (0(n)) obtained from Gq-SFl converges to a point in 



Theorem 4.9 

/?o > such th 

the e-neighborhood of the stable attractor of (39), defined as 

K 

with probability 1 as n — > oo. 



IV, given e > and q G (— oo, 1) (J(l, 1 + jf), there exists 



{x : \\x — xq\\ < e, x G K} 



Proof. It immediately follows from Lemmas 4.6 and 4.7 that the update in (33) converges to the stable 
fixed points of the ODE 

0(t) =V C {- V e J(fi(t)) + A(0(t))) • (41) 



Now starting from the same initial condition, the traject ory o f (|41j) converges to that of (39) uniformly 



over compacts, as A(8(t)) — > 0. Since from Proposition 4.8 we have ||A(0(n))|| = o(/3) for all n, the 
claim follows. It may be noted that we can arrive at the same claim more technically using Hirsch's 
lemma (iHirschl 119891) . □ 



4.2 Convergence of Gg-SF2 Algorithm 

Since the proof of convergence here is along the lines of Gg-SFl, we do not describe it explicitly. We just 
briefly describe the modifications that are required in this case. In the faster timescale, as n — > oo, the 
updates given by Z(nL) track the function 



77(71) 



Nq)p(r)(n)) 



(J(9(n) + Pri(n)) - J(0(n) - £77(71))). 



So we can rewrite the slower timescale update for Gq-SF2 algorithm, in a similar manner as (33), where 



the noise term has two components, due to the two parallel simulations, each being bounded (as in 
Lemma 4.7). We have the following proposition for the error term 



A(9(n)) = VgS' qJj [J(e)}-VeJ(0) 



14 



Proposition 4.10. For a given q < (l 



n) 



q =/= 1, and for all 9 G C, 



VeS' qt p[J(6)]-V e J(9) = o(/3) 



Proof. Using Taylor's expansion, we have for small /3, 



J(0 + prj) - J(9 - prj) = 207? T V fl J(0) + o(/3 2 ). 



One can use similar arguments as in Proposition 4.8 to rewrite (22) as 

2 



V e S^[J(9)] = 



1 



(N + 2- Nq) 



-G t (v) 



p(v) 



V e J(0)+o((3), 



which leads to the claim. □ 
Finally, we have a similar result to prove the convergence of the Gg-SF2 algorithm. 

Theorem 4.11. Under Assumptions^ for e > and q G (— oo, 1) {J(l, 1 + jj), there exists /3q > 
such that for all (3 G (0,/3o] ; the sequence (0(n)) obtained from Gq-SF2 converges to a point in the 
^-neighborhood of the stable attractor of ( |39[ ) ; with probability 1 as n — > oo. 

Theorems |4.9| and |4.11| give the existence of some f3 > for a given e > such that the gradient- 
descent algorithms converge to e-neighborhood of a local minimum. However, these results do not give the 
precise value of j3ty. Further, they do not guarantee that this neighborhood lies within a close proximity 
of a global minimum. 

We make a note on the analysis for Gaussian SF algorithms. Though the above results exclude the 
case q = 1, it is easy to verify that all the claims hold as q J, 1 due to Corollary 4.2 Hence, the above 



convergence analysis provides an alternative to the analysis presented in ( Bhatnagar 
SF algorithms. 



2007) for Gaussian 



5 Simulations using the Proposed Algorithms 
5.1 Numerical Setting 

We consider a multi-node network of M/G/l queues with feedback as shown in the figure below. There 
are K nodes, which are fed with independent Poisson external arrival processes with rates Ai, A2, . . . , Xk , 
respectively. After departing from the i th node, a customer either leaves the system with probability pi 
or enters the (i + l) th node with probability (1 — pi). Once the service at the K th node is completed, the 
customer may rejoin the 1 st node with probability (1 — Pk)- The service time processes of each node, 
{Sn(6i)}n^i,i = 1, 2, . . . ,K are defined as 



i) = Ui(n)[ _ + 110^) -0*11 



(42) 



where for all i = 1, 2, . . . , K, B4 are constants and Ui(n) are independent samples drawn from the uniform 
distribution on (0, 1). The service time of each node depends on the TVi-dimensional tunable parameter 



\ / mm \ / 1 



1,2, 



vector 8i, whose individual components lie in a certain interval 

i = 1, 2, . . . , K. 0i(n) represents the n th update of the parameter vector at the i th node, and 0t represents 
the target parameter vector corresponding to the i th node. 

The cost function is chosen to be the sum of the total waiting times of all the customers in the 
system. For the cost to be minimum, S l n (6i) should be minimum, and hence, we should have 9i(n) = 9i, 



J 


L 




Ul I Ife.^ S 


Ll 1 l<e 



Nodel 



1-P 2 



Node 2 




Figure 1: Queuing Network. 
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I = 1, 



, K. Let us denote 9 = 



IK 



) T and 9 = 



IK 



) T It is evident that 



where N = 2<=i I n order to compare the performance of the various algorithms, we consider the 
performance measure to be the Euclidean distance between 9(n) and 9, 



\9(n) 



K Ni 



0) 



i=l 3 = 1 



1/2 



The choice for such a performance measure is due to the fact that when the above distance is low, the 
queuing network provides globally optimal performance. Hence, in the results presented below, a low 
value of the distance (performance measure) implies that the algorithm converges to a closer proximity 
of the global minimum. 



5.2 Experimental Results 



For the simulations, we first consider a two queue network with the arrival rates at the nodes being 
= 0.2 and A 2 = 0.1 respectively. We consider that all customers leaving node-1 enter node-2, i.e., 
Pi = 0, while customers serviced at node-2 may leave the system with probability p2 — 0.4. We also fix 
the constants in the service times at R\ = 10 and i?2 = 20, respectively, for the two nodes. The service 
time parameters for either node are two-dimensional vectors, N\ — N2 — 2, with components lying in 



the interval 



(^0 . , m 

\ / mm \ / 1 



min 

given by C = [0.1, 0.6] 4 C M 4 . We fix the target parameter at 9 

The simulations were performed on an Intel Core i5 machine with Z.lGiB memory space and Linux 
operating system. We run the algorithms by varying the values of q and (3, while all the other parameters 
are held fixed at M = 10000 and L — 100. For all the cases, the initial parameter is assumed to be 
0(0) = (0.1, 0.1, 0.6, 0.6) T . For each set of parameters, 20 independent runs were performed with each 
run of 10 6 iterations taking about 0.5 seconds. We compare the performance of the proposed algorithms 
with gradient based SF algorithms proposed in (Bhatnagar 2007), which use Gaussian smoothing. Box- 



[0.1,0.6] for i = 1,2, j 



1,2. Thus the constrained space C is 
(0.3,0.3,0.3,0.3) T . 



Miiller method has been used to sample standard Gaussian vectors, while samples from multivariate 
g-Gaussians are drawn using the method discussed in Appendix. This method uses generation of a x 2 - 



distributed random variable, which is implemented using standard methods described in (Kroese et al 



2011 Algorithms 4.33 and 4.37). Figure 2a shows the convergence behavior of the Gaussian and proposed 
q-Gaussian based algorithms with q = 0.8, where the smoothness parameter is f3 = 0.005. 

Before going into a detailed study on the effect of q and smoothing parameter (/?) , we briefly discuss 
the effect of the step-sizes. In the plots shown in Figure 2a the step-sizes (a(n)) n ^o, (b(n)) n ^ were taken 
to be a(n) = ^, b(n) — ^75, respectively, but these can be varied to control the rate of convergence 
of the gradient updates. We briefly address this in Table Fn We fix the step-size sequence a(n) = i, 
and vary the step-size (b(n)) for gradient estimation considering it to be of the form b(n) = In 
order to satisfy Assumption IV we need to consider 7 £ (0.5, 1). We study the performance of Gg-SFl 
and Gg-SF2 with different values of 7. The value of /3 = 0.005 is chosen to be in the range where the 



Gaussian SF algorithms perform well (Bhatnagar 
range (—00, 1 + j?) = (—00, 1.5). It may noted that q 
corresponds to Cauchy, and q — > — 00 gives the uniform distribution 



2007), whereas the values of q are chosen over the 
1 corresponds to the Gaussian case, q = 1.4 




(a) 4-dimensional problem (b) 20-dimensional problem 

Figure 2: Convergence behaviousr of SF and g-SF algorithms for q — 0.8, j3 = 0.005. 
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Table 1: Performance of all algorithm for different values of q with /? = 0.005, and step-sizes a(n) = - 
and b{n) = 



We show the distance of the final updates 0(M) from the optimum 8, averaged over 20 trials. The 
variance of the updates is also shown to indicate the robustness of the algorithms. The results agree with 
previous observations ( Bhatnagar 2007 ) that two-simulation algorithms perform better than their one- 
simulation counterparts. It also shows that in quite a number of cases, q-SF algorithms perform better 
than the Gaussian case. The results for different time-scales, controlled by 7, seem to be comparable. 
So we fix 7 = 0.75 for further analysis. 

We now focus on the effect of the value of q and the smoothing parameter j3 used in the algorithms. 
The results for Gg-SFl and Gq-SF2 algorithms are presented in Tables [2] and [3] For each value of /?, 
the instances where the performance for the q-Gaussian is better than Gaussian-SF are marked in bold. 
Also, the least distance obtained for each j3 is underlined. Similar trends in performance are observed in 
both the tables. 

It can be seen that better results are usually obtained for q > 1 for low values of f3. As (3 increases 
lower values of q give better performance, while the higher q- values result in higher variance. The extreme 
right column indicates a deterioration in the standard error performance, which can be attributed to the 
dependence of the error term on the parameter /3. However, results for q < 1 seem to be less affected 
and hence, relative performance of algorithms with q < 1 are better for higher values of /3. On the other 
hand, the noise term studied in Lemma |4.7| has two effects, observed in the left column and the bottom 
row of either tables. In the lemma, we obtain a upper bound on the variance of the noise £„, which, 
though finite can be arbitrarily large for small j3 or q close to (1 + jr). Thus, the result worsens in these 
cases. In fact, some simulations (not presented here) showed that the performance deteriorated further 
when (3 was made even smaller. The poor performance for small values of g, such as q — —10, can be 
argued in a very simple way. In such a case, the support of the distribution is a very small region around 
its mean at zero. Hence, the perturbations ry are not large enough to achieve sufficient exploration that 
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Table 2: Performance of Gq-SFl algorithm for different values of q and /3. 



17 



9 


0.0005 


0.001 


0.005 


0.01 


0.05 


0.1 


-10 

-5 

-0.5 

0.2 
0.4 
6 
0.8 


0.06036±0. 04502 

A AflCO A _l_ A 1 1 A tZ T 

u.ui * oi__tu.uo ! ±oy 
0.05658±0. 10012 
0.00336±0. 00264 
0.00481±0. 00475 
0.00473±0. 00421 

O.OOlOOztO. 00194 


0.08081±0. 05699 

a noo'70 _L a /ni oc 

n noi i 5-+-n non^7 

U.UZl±o__tU.U.iUD ( 

0.00362±0. 00572 

0.00136±0. 00256 

0.00090±0. 00086 

0.00076±0. 00103 

n nnnfi*?^n nm 1 i 
u.uuud^ nz u.uuiij 

0.00026±0. 00037 


0.06081±0. 05528 

A _l_ A AAC^i? 

U.Ulzzb±U.UUobo 
a nm i finnan 

0.00035±0. 00025 
0.00035±0. 00062 
O.OOOllztO. 00005 
0.00012±0. 00005 

0.00013±0. 00002 


0.02630±0. 00787 

A AAj2CA_I_A AA^IAT 

n nnri/in-J-n nnm ^ 

U . UUU4iU__t U . UUU1D 

0.00019±0. 00008 
0.00019±0. 00010 
0.00016±0. 00008 
0.00026±0. 00007 
u.uuu^o zn u.uuuiu 
0.00023zt0. 00007 


0.02245±0. 00741 

A AArt *">*7_1_ A AAOf A 

a nm nn4-n nnnoo 
0.00104zt0. 00027 
0.00090zt0. 00032 
0.00103zt0. 00048 
0.00091zt0. 00021 
n nm i oA-n nnnfii 

u.UUlliHU.UUlfDl 

0.00142zt0. 00061 


0.02176±0. 00894 
0. 08 1 9 zL 0.00448 

U.UU^XOzlzU.UULIO^t 

0.00165zt0. 00050 
0.00214zt0. 00070 


0.00314zt0. 00124 
0.00327zt0. 00078 

U.UUD1D Z1Z U.UUl^D 

0.01331zb0. 00146 


Gaussian 


0.00483±0. 00531 


0.00160±0. 00142 


0.00030±0. 00013 


0.00026ztO. 00015 


0.00094zt0. 00044 


0.01119zb0. 00168 


1.1 
1.2 
1.3 
Cauchy 
1.49 


0.00102±0. 00178 
0.00120±0. 00154 
0.00316±0. 00670 
0.00314±0. 00398 
0.00424±0. 00226 


0.00033±0. 00048 
0.00018±0. 00019 
0.00074±0. 00061 
0.00145±0. 00135 

0.00600±0. 00431 


0.00014±0. 00003 
0.00016±0. 00003 
0.00018±0. 00004 

0.00062±0. 00030 
0.00494±0. 00228 


0.00026zt0. 00012 
0.00022zt0. 00008 

0.00030±0.00014 
0.00138zt0. 00053 
0.00682zb0. 00230 


0.00139zt0. 00045 
0.00208±0. 00035 
0.00515zt0. 00066 
0.01462zt0. 00060 
0.02771zt0. 00178 


0.04619zb0. 00229 
0.04432zb0. 00286 
0.04825zb0. 00176 
0.05675±0. 00373 
0.06799zb0. 00536 



Table 3: Performance of Gq-SF2 algorithm for different values of q and j5. 



would be required to reach a good optimum. 

Another interesting phenomena is observed that requires some discussion. For (3 in the range 0.001 
to 0.01, it is seen that the results for the Gaussian case (q —> 1) are poor when compared with those for 
q close to 1. This result appears counter- intuitive as the performance of the Gaussian SF "should have 
been" a limiting case of the g-Gaussian SF. Hence, this observation needs a detailed analysis. A closer 
look at the presented algorithms shows that the value of q plays a role only in two steps - generation 
of the perturbations 77, and the gradient update rule. It is easy to verify that the gradient update is 
a continuous function of q at q = 1, and hence, cannot induce the observed effect. Thus, the issue 
lies with the sampling step, which is in turn dependent on the \ 2 variate used in the sampling method 
discussed in Appendix. One can verify that for q close to 1, the parameter of the % 2 random variable, 
say m (used in Step 2 of sampling method), is a large positive quantity, and so implementations involve 
Algorithm 4.33 from ( Kroese et al.| 2011[ ). Assuming m is quite large, it follows from Algorithm 4.33 
that that random variate a ~ X 2 (rn) is close to ?p, and so, roughly Y « \[2Z. On the other hand, for 
the case of q = 1 (Gaussian), Y = Z. Hence, the perturbations are larger for q close to 1 as compared 
to q = 1, which helps the algorithms to avoid getting stuck at any local minimum. This phenomena is 
not observed as /3 increases due to increased smoothing. Here, relative performance of Gaussian SF is 
better, but in these cases, the overall performance of the algorithms becomes worse and hence, no benefit 
is achieved by the Gaussian. 

We perform similar experiments in a higher dimensional case. For this, we consider a four node 
network with = 0.2 for all i = 1, ... ,4. The probability of leaving the system after service at each 
node is pi = 0.2 for all nodes. The service process of each node is controlled by a 5-dimensional parameter 
vector, and a constant set at Ri = 10. Thus, we have a 20-dimensional constrained optimization problem, 
where each component can vary over the interval [0.1,0.6] and the target is 0.3. The parameters of the 
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Table 4: Performance of the Gg-SF2 algorithm for different values of q and /3. 
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algorithms are held fixed at M = 10000, L = 100 and e = 0.1. Each component in the initial parameter 
vector is assumed to be #W(0) = 0.6 for all i = 1, 2, . . . , 20. The step-sizes were taken to be a(n) — - 
and b(n) = 1 85 . For each (q, /3) tuple, 20 independent runs were performed. Each run of 10 6 iterations 
took about 1 second on an average. Figure [2b| shows the convergence behavior of the algorithms in this 
case. The performance of one-simulation algorithm is relatively poor. Table [4] shows the performance of 
Gq-SF2 algorithms for different (q, /3)-pairs. We observe that the trend is similar in this case showing 
that the algorithms scale in a similar manner. 



6 Conclusions 

The q-Gaussian distribution is an important power-law distribution that has connections with generalized 
information measures. The power-law behavior of q-Gaussians provide a better control over smoothing 
of functions as compared to the Gaussian distribution. We have extended the Gaussian smoothed 
functional approach for gradient estimation to the q-Gaussian case by showing that the q-Gaussian 
distribution satisfies the Rubinstein conditions (IRubinstein 1981). Further, we developed optimization 



algorithms that incorporate q-Gaussian smoothing. This extension turns out to be more significant when 
we note that the q-Gaussians encompass all the existing smoothing kernels - Gaussian (q — > 1), Cauchy 
(q = 1 + jyxj) and uniform (q — > — oo). 

We proposed two q-Gaussian SF algorithms for simulation optimization. We use a queuing network 
example to show that for certain values of q, the results provided by the proposed algorithms are signifi- 
cantly better than Gaussian SF algorithms. Our simulation results even indicate that for some q-values, 
the performance is better as compared to all the above mentioned special cases. We also presented proof 
of convergence of the proposed algorithms to a local minimum of the objective function. 

It would be interesting to develop Hessian estimators incorporating the q-Gaussian smoothed func- 
tional, and developing Newton based algorithms along these lines. 



APPENDIX 

Sampling algorithm for multivariate g-Gaussian distribution 

The algorithms discussed in the paper require generation of a multivariate q-Gaussian distributed ran- 
dom vector, whose individual components are uncorrelated and identically distributed. This implies that 
the random variables are q-independent (Umarov and Tsallis 2007). For the limiting case of q — >• 1, 



q-independence is equivalent to independence of the random variables. Hence, we can use standard 
algorithms to generate i.i.d. samples. This is typically not possible for q-Gaussians with q ^ 1 



Thistle- 



ton et al. (20071 proposed an algorithm for generating one-dimensional q-Gaussian distributed random 
variables using generalized Box-Miiller transformation. But, there exists no standard algorithm for 
generating iV-variate q-Gaussian random vectors. 

A method can be obtained by making use of the one-to-one correspondence between q-Gaussian and 
Students'-i distributions for q > 1. Further, a duality property of q-Gaussians can be used to relate the 
distributions for q 6 (1,1+ jf^) and q € (— oo, 1). This observation, first made by Vignat and Plastino 
( 2006 ) , is shown below. We denote the q-Gaussian distribution with q-mean /i q and q-covariance ll q as 
Gqil^q, Based on this, we formally present an algorithm, which can be used to generate multivariate 
q-Gaussian distribution. Theoretical justification behind the algorithm is provided in the following 



results (Vignat and Plastino| 2006) 



X 2 (m), 



Lemma A.l. Given Z ~ A/"(0, Inxn) and a> 
G q (0, Inxn), where q = (l + j^^j ■ 

Lemma A. 2. Let Y ~ Gq{0, Inxn) for some q G (1,1+ jvT2/ an< ^ 



> 0, then the vector Y = 



Z 



X 



2-g 



N+2-Nq 



Y 



g-i 



N+2-Nq 



Y T Y 



Then X <~ G q '(0,lNxN), where q' = ^1 — 



(N+4)-{N+2)q 
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Lemma A. 3. IfY ~ G q {0,lNxN) for some q € ( — oo, l) (J (l, 1 + ^) , i/ien 



Sampling algorithm for multivariate g-Gaussian distribution 



Input: 

a) g 6 ( — oo, 1+ ^1, where g = 1 is assumed to be the Gaussian case 

b) g-mean, fi q € 

c) g-covariance matrix E 9 £ M iVxJV 

1 Generate JV-dimensional standard Ga ussian vector Z ~ A/ "(0, Inxn)- 

2 Generate chi-squared random variate (Kroese et al. 2011 Chapter 4.2.6) 



a ~ < 



X 2 (^) for -oo<g<l, 

^(A^VI) for 1<?<(l+ 2). 



3 Compute 



j for _oo<g<l, 



Y = { Z for g = 1, 

for l<g<(l + A) 



Output: X = + Yjq/ 2 Yj, which is a sample from Q q (fi q ,T, q 
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